arXivrl 502.07828V 1 [cs.MM] 27 Feb 2015 


HYBRID CODING OF VISUAL CONTENT AND LOCAL IMAGE FEATURES 


Luca Baroffio, Matteo Cesana, Alessandro Redondi, Marco Tagliasacchi, Stefano Tubaro 
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano 


ABSTRACT 

Distributed visual analysis applications, such as mobile visual search 
or Visual Sensor Networks (VSNs) require the transmission of visual 
content on a bandwidth-limited network, from a peripheral node to 
a processing unit. Traditionally, a “Compress-Then-Analyze” ap¬ 
proach has been pursued, in which sensing nodes acquire and encode 
the pixel-level representation of the visual content, that is subse¬ 
quently transmitted to a sink node in order to be processed. This ap¬ 
proach might not represent the most effective solution, since several 
analysis applications leverage a compact representation of the con¬ 
tent, thus resulting in an inefficient usage of network resources. Fur¬ 
thermore, coding artifacts might significantly impact the accuracy of 
the visual task at hand. To tackle such limitations, an orthogonal 
approach named “Analyze-Then-Compress” has been proposed m 
According to such a paradigm, sensing nodes are responsible for 
the extraction of visual features, that are encoded and transmitted to 
a sink node for further processing. In spite of improved task effi¬ 
ciency, such paradigm implies the central processing node not being 
able to reconstruct a pixel-level representation of the visual content. 
In this paper we propose an effective compromise between the two 
paradigms, namely “Hybrid-Analyze-Then-Compress” (HATC) that 
aims at jointly encoding visual content and local image features. Fur¬ 
thermore, we show how a target tradeoff between image quality and 
task accuracy might be achieved by accurately allocating the bitrate 
to either visual content or local features. 

Index Terms — Local features, BRISK, Image compression. 
Predictive coding 

1. INTRODUCTION 

In the last few years, local features have been effectively exploited in 
a number of visual analysis tasks such as augmented reality, object 
recognition, content based retrieval, image registration, etc. They 
provide a robust yet concise representation of an image patch that 
is invariant to local and global transformation such as illumination 
and viewpoint changes. The traditional pipeline for the extraction 
of local image feature consists of two main stages: i) a keypoint de¬ 
tector, that aims at identifying salient points within an image and 
ii) a keypoint descriptor that captures the local information of the 
image patch surrounding each keypoint. Traditional algorithms for 
keypoint description, such as SIFT m and SURF 0 , assign to each 
salient point a description by means of a set of real-valued elements, 
capturing local information based on intensity gradient. More re¬ 
cently, a novel class of algorithms, namely binary descriptors, has 
emerged as an effective, yet computationally efficient, alternative to 
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SIFT and SURF. Such features usually rely on smoothed pixel in¬ 
tensities and not on local intensity gradients, vastly improving the 
computational efficiency. The BRIEF 11 descriptor consists of a set 
of binary values, each obtained by comparing the smoothed inten¬ 
sity of two pixels, randomly sampled around a keypoint. BRISK ||5l, 
ORB m and FREAK m refine the process, introducing ad-hoc 
designed spatial patterns of pixels to be compared and achieving 
rotation-invariance. More recently, BAMBOO exploits a pair¬ 
wise boosting algorithm to build a discriminative pattern of pairwise 
pixel intensity comparisons. 

Local features represent a key component of many distributed 
visual analysis applications such as Mobile Visual Search, aug¬ 
mented reality, and Visual Sensor Networks applications. Tradi¬ 
tionally, such tasks have been tackled according to a Compress- 
Then-Analyze (CTA) approach, in which sensing nodes acquire the 
content, encode it resorting to picture or video coding primitives, e.g. 
JPEG or H.264/AVC, and transmit it to a central server that extracts 
local features and performs a given visual analysis task. According 
to CTA, the pixel-level representation of the acquired visual content 
is actually sent to the sink node. A number of applications rely 
on compact representations of the content, in the form of local or 
global features. In this context, CTA might not be the most efficient 
approach, since unnecessary and possibly redundant information 
is sent on the network. Eurthermore, the central processing node 
receives and exploits a lossy version of the originally acquired visual 
content. Artifacts introduced by coding algorithms may affect the 
accuracy of several applications C]. Several works in the literature 
aim at adapting both image Co) and video ifTTI compression 
architectures so that the quality local features is preserved. 

An alternative paradigm, ndirndy Analyze-Then-Compress (ATC), 
has been introduced in m Such an approach aims at tackling the 
limitations posed by CTA. According to ATC, the sensing nodes 
acquire the visual content, extract information in the form of local 
or global features, that are encoded and transmitted to a sink node 
that performs visual analysis based on such features. Such paradigm 
moves part of the computational complexity from the central unit 
to the sensing nodes. To this end, efficient algorithms for visual 
feature extraction OIEl and coding architectures tailored to global 
and local visual features [TU [TS] [T^ [TtI have been proposed. The 
task efficiency is improved, since only relevant information is ac¬ 
tually transmitted over the network. Still, the sink node is not able 
to reconstruct the original pixel-level representation of the visual 
content. 

In this paper we propose a novel hybrid approach to distributed 
visual analysis tasks aimed at overcoming the limitations of both 
ATC and CTA. Hybrid-Analyze-Then-Compress (HATC) represents 
an efficient solution for the joint coding of both pixel-level and local 
feature-level representations. Eurthermore, the allocation of the bit 
budget to either visual content or image feature is thoroughly inves¬ 
tigated. 

Moulin et al. cu addressed the problem of jointly encoding 



pixel-level content and global image features such as either Bag-of- 
Words histograms or integral channel features in the context of scene 
classification or pedestrian detection, respectively. Differently, we 
focus on the joint encoding of visual content and local image fea¬ 
tures, typically consisting of sets of salient points, along with their 
descriptors. 

The rest of this paper is organized as follows: Section [^intro¬ 
duces the problem, defining tools and objectives, Sectionj^describes 
the proposed paradigm, Sectionj^is devoted to experimental evalua¬ 
tion. Finally, Sectionj^draws conclusions and discusses future work. 

2. PROBLEM STATEMENT 

Let X denote an image that is acquired by a sensing node. Such 
image is processed in order to extract a set of features V. To this 
end, a detector is applied to the image in order to identify inter¬ 
est points. The number of detected keypoints M — \V\ depends 
on both the image content and on the type and parameters of the 
adopted detector. Then, a keypoint descriptor is computed starting 
from the orientation-compensated patch surrounding each interest 
point. Hence, dm G is a local feature, that consists of two com¬ 
ponents: i) a 4-dimensional vector Cm = [xm^ym^cFm^ Om]^ , indi¬ 
cating the position (xm, Vm), the scale Om of the detected keypoint, 
and the orientation angle Om of the image patch; ii) a D-dimensional 
vector dm, which represents the descriptor associated to the key- 
point Cm. According to Analzyze-Then-Compress, the set of features 
V is encoded and transmitted to a sink node for further analysis. On 
the other hand. Compress-Then-Analyze would require the acquired 
image T to be encoded and transmitted to a central unit where it is 
analyzed. In details, the sink node receives the bitstream and recon¬ 
structs a lossy version of the original image X. Then, similarly to 
the case of ATC, a set of local descriptors is extracted and exploited 
to perform a given visual analysis task. However, the image coding 
process introduces artifacts that may affect the extraction of local 
features and, as a consequence, the task accuracy. 

We propose an alternative approach, namely Hybrid-Analyze- 
Then-Compress, that aims at efficiently coding both pixel-domain 
and feature-domain representations of the visual content. In partic¬ 
ular, according to such paradigm, the decoder is capable of recon¬ 
structing both a lossy representation of the original image X (en¬ 
coded with Rf bits) and a subset of the original features X>hatc (en¬ 
coded with Rdhatc bits), thus requiring Dhatc = Rx + ^^^hatc 
total. 

The HATC approach is generally applicable to any kind of local 
feature. In this paper, we focus on the case in which binary descrip¬ 
tors are used, i.e., dm G {0,1}^. Each descriptor element is a bit, 
representing the result of a pairwise comparison of smoothed pixel 
intensities sampled from an ad-hoc designed pattern around a given 
interest point. In particular, we consider BRISK jS) binary features. 

3. HATC CODING ARCHITECTURE 

Figure illustrates the pipeline of the HATC coding architecture. 
As regards the coding of the pixel-level representation of the visual 
content, HATC is equivalent to the CTA approach. That is, the ac¬ 
quired image is encoded and sent to the sink node. Here, the bit- 
stream is decoded and a lossy representation of the image X is re¬ 
constructed. CTA would run a detector and a descriptor algorithm 
on X, obtaining visual features whose effectiveness is possibly im¬ 
paired by the image coding artifacts. The key idea behind HATC is to 
add an enhancement layer that allows the central processing node to 


reconstruct a subset Xhatc of the original local descriptors V. Such 
approach allows for the refinement of an arbitrarily-sized subset of 
features extracted from lossy pixel-level content, yielding a tradeoff 
between bitrate and task accuracy. The higher the number Z of fea¬ 
tures that are refined, the higher the resulting bitrate and the higher 
the accuracy of the visual analysis task to be performed. 

To construct the feature enhancement layer, the sensing node 
extracts a set of interest points K, from the acquired image X. The 
sensing node computes the sets of descriptors T> and T> from the 
original image X and X, respectively. Descriptors are computed in 
correspondence to the locations defined by the set K. Finally, a sub¬ 
set Xhatc of the set of original descriptors V is differentially encoded 
with respect to the set of lossy descriptors V. 

At the central processing node, a lossy representation X of the 
original image is decoded, along with the set of keypoint locations 
1C. The set of descriptors T) is computed exploiting the lossy coded 
image X, at the locations defined by K,. Finally, the bitstream related 
to the enhancement layer X* is decoded and exploited in order to 
reconstruct the subset Xhatc of the original descriptors T>. 

The HATC paradigm requires three main components to be en¬ 
coded and transmitted to the central node: 

• X*, i.e., the bitstream needed to reconstruct a lossy represen¬ 
tation X of the original image X; 

• K*, i.e., the bitstream needed to reconstruct the location of 
the keypoints extracted from the original image X; 

• ^HATc» i-c., the bitstream needed to reconstruct the feature 
enhancement layer Xhatc- 

In a summary, HATC offers advantages with respect to both ATC 
and CTA. First, differently from ATC, the central unit is capable 
of reconstructing the pixel-level visual content. Second, differently 
from CTA, HATC allows the sink node to operate on high quality 
visual features, yielding a higher task accuracy. 

3.1. Differential coding of binary local features 

For HATC to be competitive with other approaches, an effective ad- 
hoc coding architecture has to be developed. Consider the sets of 
descriptors T> and X, extracted from an input image X and its lossy 
counterpart X, respectively. The proposed differential coding archi¬ 
tecture aims at efficiently encoding the descriptors Xhatc, exploiting 
X as a predictor. The key tenet behind HATC is that the two sets of 
descriptors, extracted in correspondence of a common set of interest 
point locations, are correlated. In a sense, such a scenario is similar 
to that of features extracted from contiguous frames of a video se¬ 
quence. In that case, inter-frame predictive coding can be exploited 
to improve coding efficiency, reducing the output bitrate cmsiiisi. 

In the case of HATC, given a binary descriptor dm C X and its 
counterpart dm C X extracted from the original and the decoded 
images, respectively, the prediction residual can be computed as 

Cm — dm © dm? (1) 

that is, the bitwise XOR between dm and dm- 

In binary descriptors, each element represents the binary out¬ 
come of a pairwise comparison between smoothed pixel intensities. 
Hence, the dexels (descriptor elements) are potentially statistically 
dependent, and so are the elements of the prediction residual Cm- 
In this context, it is possible to model the prediction residual as a 
binary source with memory. Let iij, j C [1, D] represent the j-th 
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Fig. 1. Block diagram of a) HATC joint feature-image encoder; b) HATC joint feature-image decoder. 


element of a prediction residual, where D is the dimension of such 
a descriptor. The entropy of such an element can be computed as 

Hi-Kj) = -Pi(0)log2(pi(0)) -pj(l)log2(pi(l)), (2) 

where Pj{0) and Pj{l) are the probability of ttj = 0 and ttj = 1, 
respectively. Similarly, the conditional entropy of element TVj^ given 
element can be computed as 


4. EXPERIMENTS 

The effectiveness of the proposed paradigm has been evaluated and 
compared with that of both Compress-Then-Analyze and Analyze- 
Then-Compress, with respect to a content-based image retrieval ap¬ 
plication. 

4.1. Datasets 
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^ Pii,j2(a:,J/)log2 
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with ji,j 2 C [1, T)]. Let nj, j = 1,..., D, denote a permutation of 
the prediction residual elements, indicating the sequential order used 
to encode a descriptor. The average code length needed to encode a 
descriptor is lower bounded by 


p 

-R = y^i?(7rj|7r2-i,...,7ri). (4) 

i=i 


We exploit the publicly available Zurich building dataset (ZuBuD) EU 
in order to evaluate the performance of HATC. Such a dataset con¬ 
sists of 1005 pictures representing 201 different Zurich buildings (5 
different views for each object). A test set composed of 115 image 
queries, each one capturing a different building, is also provided. 
Database and query images have heterogeneous resolutions and 
imaging conditions. As regards the training phase, 1000 images 
have been randomly sampled from the MIRFLICKR dataset 
and they have been exploited to compute the coding-wise optimal 
dexel order and the associated coding probabilities, as illustrated in 
Section [3] 


In order to maximize the coding efficiency, we aim at finding the 
permutation of elements ..., that minimizes such a lower 
bound. For the sake of simplicity, we model the source as a first- 
order Markov source. That is, we impose — 

Then, we adopt the following greedy strategy to re¬ 
order the elements of the prediction residual: 

argmin^^.//( tTj) j = l 

arg H{nj ItTj-i) j € [2, D] 

Note that such optimal ordering is computed offline, thanks to a 
training phase, and shared between both the encoder and the decoder. 

3.2. Coding of keypoint locations 

Consider an Nx x Ny image X. The coordinates of each key- 
point Cm C 1C (at quarter-pel accuracy) are encoded using Rc^ = 
Mn (log 2 4+ log 2 4:Ny + S) bits, where S is the number of 
bits used to encode the scale parameter. Higher coding efficiency is 
achievable implementing ad-hoc lossless or lossy coding schemes to 
compress the coordinates of the keypoints 1^1^ . 


4.2. Methods 

We compared the performance of the following paradigms: 

• Compress-Then-Analyze (CTA): each query picture is en¬ 
coded resorting to JPEG. Subsequently, BRISK local features 
are extracted from the lossy compressed image and exploited 
for the retrieval pipeline; 

• Analyze-Then-Compress (ATC): each query picture is pro¬ 
cessed in order to extract a set of BRISK features, that are 
encoded resorting to the architecture proposed in and ex¬ 
ploited for the retrieval pipeline; 

• Hybrid-Analyze-Then-Compress (HATC): a local feature en¬ 
hancement layer, composed by a subset of the BRISK fea¬ 
ture extracted from the uncompressed image, is generated and 
differentially encoded according to the procedure presented 
in Section Such features are exploited for the retrieval 
pipeline. 

4.3. Parameter settings 

As for CTA, we define a set of possible values for the JPEG 

quality factor Q — {5,10,15, 20, 50, 70} in order to generate 























Fig. 2. Feature coding efficiency as a function of the distortion 
(PSNR) between the original and the lossy pixel-level visual con¬ 
tent. 



© 




® -1 


O CTA - Q = 5 - 23.93dB PSNR 

- « - HATC - Q = 5 

O CTA-Q- 10-26.45dB PSNR 

- « - HATC-Q = 10 
O CTA - Q = 15 - 27.84dB PSNR 

- « - HATC-Q = 15 
O CTA - Q = 20 - 28.80dB PSNR 

- « - HATC - Q = 20 
O CTA-Q = 50-32.1 OdB PSNR 

- *• - HATC - Q = 50 
O CTA - Q = 70 - 38.80dB PSNR 

- « - HATC - Q = 70 
-B-ATC 


6 8 10 
bitrate [KB/query] 


12 


14 


a rate-accuracy curve. As to ATC, a similar rate-accuracy curve 
is obtained by imposing different BRISK detection thresholds 
tBRISK = {70, 75,80,85, 90, 95,100,105}. Finally, as to HATC, 
for each JPEG quality factor, a rate-accuracy curve is obtained by 
setting the number Z = {25, 50,100,150} of features to be refined 
resorting to a feature enhancement layer, as reported in Sectionj^ 


4.4. Evaluation metrics 


We evaluate the performance in terms of rate-accuracy curves. In 
particular, the accuracy of the task is evaluated according to the 
Mean Average Precision (MAP) measure. Given an input query im¬ 
age Xg, it is possible to define the Average Precision as 


APq 


Y.l=lPq{kyq{k) 

Rq 


( 6 ) 


where Pq (k) is the precision (i.e., the fraction of relevant documents 
retrieved) considering the iop-k results in the ranked list of database 
images; rq (k) is an indicator function, which is equal to 1 if the item 
at rank k is relevant for the query, and zero otherwise; Rq is the total 
number of relevant document for query Zq and Z is the total number 
of documents in the list. The overall Mean Average Precision for the 
whole set of query images is computed as 

A p 

MAP=^=^^ -(7) 

where Q is the total number of queries. 

The quality of a JPEG coded image is evaluated according to its 
PSNR with respect to the uncompressed image. 


4.5. Results 

Eigurej^ shows the feature coding efficiency achieved by the differ¬ 
ential encoding module (see Eigure[^ as a function of the distortion 
(PSNR) between the original image and the lossy one reconstructed 
resorting to CTA. The lower the distortion (the higher the PSNR), 
the more effective the HATC feature coding architecture. Nonethe¬ 
less, high PSNRs correspond to low distortion values, and thus the 
accuracy increment yield by HATC is smaller. 

Eigure compares the rate-accuracy performance of the three 
approaches. Eor example, when 4 KB/query are allocated, CTA 
achieves a MAP equal to 0.71. This value increases to 0.75 when us¬ 
ing HATC, trading-off accuracy for visual quality (which decreases 


Fig. 3. Rate-accuracy curves comparing the performance of ATC, 
CTA and HATC. 
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Fig. 4. Tradeoff between pixel-level distortion (PSNR) and visual 
analysis task accuracy (MAP) obtained resorting to the HATC archi¬ 
tecture. Each curve refers to a target bitrate budget. 


from 26.4dB to 23.9dB). ATC achieves a slightly higher MAP (0.76), 
but the pixel-domain content is not available at the decoder. A simi¬ 
lar analysis can be performed for different target bitrate budgets. Eig- 
ure|4] shows the MAP-PSNR trade-offs that are achievable when tar¬ 
geting a given bitrate. When the available bitrate is equal to 3KB per 
query, a single working point corresponding to 0.66 MAP @ 24dB 
PSNR is achievable. At higher target bitrates (e.g. 4-7 KB/query), 
it is possible to select a trade-off between MAP and PSNR by accu¬ 
rately allocating the available bitrate to either the pixel-level or the 
feature-level representations. 

5. CONCLUSIONS 

In this paper we propose Hybrid-Analyze-Then-Compress, an ef¬ 
fective paradigm tailored to distributed visual analysis tasks. Such 
model exploits a joint pixel- and local feature-level coding architec¬ 
ture, leading to significant bitrate savings. Euture work will aim at 
improving the coding efficiency of both the keypoint location and the 
descriptor enhancement layer modules and at extending the approach 
to different classes of local features (e.g. SIET, SURE descriptors). 
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